TF-IDF and LSA

Term Frequency - Inverse Document Frequency (TF-IDF) helps to estimate the importance of each word in a chunk of text based on how common it is throughout a set of documents.


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first sentence',
    'This is the second sentence',
    'Something that has nothing to do with the others',
    'This is the final sentence'
]

In [ ]:
vectorizer = TfidfVectorizer(use_idf=True)
vectors = vectorizer.fit_transform(corpus)

In [38]:
vectors.shape


Out[38]:
(4, 15)

In [48]:
import pandas as pd

df = pd.DataFrame(index=vectorizer.get_feature_names())

for i,doc in enumerate(vectors):
    df[f"sentence {i+1}"] = doc.T.todense()

df


Out[48]:
sentence 1 sentence 2 sentence 3 sentence 4
do 0.000000 0.000000 0.347685 0.000000
final 0.000000 0.000000 0.000000 0.633146
first 0.633146 0.000000 0.000000 0.000000
has 0.000000 0.000000 0.347685 0.000000
is 0.404129 0.404129 0.000000 0.404129
nothing 0.000000 0.000000 0.347685 0.000000
others 0.000000 0.000000 0.347685 0.000000
second 0.000000 0.633146 0.000000 0.000000
sentence 0.404129 0.404129 0.000000 0.404129
something 0.000000 0.000000 0.347685 0.000000
that 0.000000 0.000000 0.347685 0.000000
the 0.330402 0.330402 0.181437 0.330402
this 0.404129 0.404129 0.000000 0.404129
to 0.000000 0.000000 0.347685 0.000000
with 0.000000 0.000000 0.347685 0.000000

LSA

TF-IDF Limitations: two documents on the same subject, but with different word spellings, will end up not being related at all

Latent Semantic Analysis (LSA) is used to create a vector representation of tf-idf word scores which turn into "topic vectors". This enables semantic search which can query documents based on their meaning or find similar documents. Vectors can be added and subtracted to learn new things, and vectors of similar length and direction end up having similar meanings!

LSA is unsupervised

LDA and LDiA

Linear Discriminant Analysis (LDA) creates a single topic vector per document. Just compute the centroid of all TF-IDF vectors for each class in a multiclass classification problem (spam vs. non-spam) and find the line between each centroid. Further a TF-IDF vector is on the line tells you which class you're in.

LDA is a supervised algorithm that needs labels for training.

Latent Dirichlet Allocation (LDiA) can create multiple topic vectors per document


In [82]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

X = vectors.toarray()
y = [10,10,0,10]

clf = LDA()
clf.fit(X, y)


Out[82]:
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [83]:
clf.predict([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]])


Out[83]:
array([10])

In [85]:
clf.predict([df.iloc[:,2].values])


Out[85]:
array([10])

In [86]:
df.iloc[:,2].values


Out[86]:
array([0.34768534, 0.        , 0.        , 0.34768534, 0.        ,
       0.34768534, 0.34768534, 0.        , 0.        , 0.34768534,
       0.34768534, 0.18143663, 0.        , 0.34768534, 0.34768534])

In [ ]: